We provide on this web site additional material related to the article "Here is the Data. Where is its Schema?" submitted to the 24th International WWW Conference 2015 as submission #312.
You will find below the detailed hierarchies, ground truth class distributions and MDL evolution for hierarchies discovered by SQuaScheD on all datasets mentioned in the paper.
We present below interactive visualizations showing the most representative attributes and entities for each class of the SQuaScheD discovered hierarchies for each datasets.
We can observe that the divergence between the SQuaScheD results and the ground truth is mainly due to two factors: First, some leaf classes in the ground truth can be further divided into subclasses indeed. For instance, election can be divided into & state election and general election; Event can divided into events in different locations, such as US, Korea, China, etc. Second, there are meta attributes in the data that may mislead the discovery process. For example, some entities are assocated with images and some entities are not; as an image has multiple attributes, such as image_size and url, it may force Squasched to divide the entities into a class with image and a class without image.
Distribution of the bottom-most ground-truth class in the discovered class hierarchy for all datasets.
Hint: you can click on the images to enlarge them.